Find this repository: https://github.com/libjohn/workshop_textmining
Much of this review comes from the site: https://juliasilge.github.io/tidytext/
The primary library package tidytext enables all kinds of text mining. See Also this helpful free online book: Text Mining with R: A Tidy Approach by Silge and Robinson
library(janeaustenr)
library(tidyverse)## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tidytext)
library(wordcloud2)Data
We’ll look at some books by Jane Austen, an 18th century novelist. Austen explored women and marriage within the British upper class. The novelist has a unique and well earned following within literature. Her works is consistently discussed and honored. To this day, Austen’s novels are the source of many adaptations, written and on-screen. Through the janeaustenr package we can access and mine the text of six Austen novels. We can call the collection of novels a corpra. An individual novel is a corpus.
austen_books()## # A tibble: 73,422 x 2
## text book
## * <chr> <fct>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility
## 2 "" Sense & Sensibility
## 3 "by Jane Austen" Sense & Sensibility
## 4 "" Sense & Sensibility
## 5 "(1811)" Sense & Sensibility
## 6 "" Sense & Sensibility
## 7 "" Sense & Sensibility
## 8 "" Sense & Sensibility
## 9 "" Sense & Sensibility
## 10 "CHAPTER 1" Sense & Sensibility
## # ... with 73,412 more rows
Austen is best know for six published works:
austen_books() %>%
distinct(book)## # A tibble: 6 x 1
## book
## <fct>
## 1 Sense & Sensibility
## 2 Pride & Prejudice
## 3 Mansfield Park
## 4 Emma
## 5 Northanger Abbey
## 6 Persuasion
Data Cleaning
Text mining typically requires a lot of data cleaning. In this case, we start with the janeaustenr collection that has already been cleaned. Nonetheless, further data wrangling is required. First, identifying a line number for each line of text in each book.
Identify line numbers
original_books <- austen_books() %>%
group_by(book) %>%
mutate(line = row_number()) %>% # identify line numbers
ungroup()
original_books## # A tibble: 73,422 x 3
## text book line
## <chr> <fct> <int>
## 1 "SENSE AND SENSIBILITY" Sense & Sensibility 1
## 2 "" Sense & Sensibility 2
## 3 "by Jane Austen" Sense & Sensibility 3
## 4 "" Sense & Sensibility 4
## 5 "(1811)" Sense & Sensibility 5
## 6 "" Sense & Sensibility 6
## 7 "" Sense & Sensibility 7
## 8 "" Sense & Sensibility 8
## 9 "" Sense & Sensibility 9
## 10 "CHAPTER 1" Sense & Sensibility 10
## # ... with 73,412 more rows
Tokens
To work with these data as a tidy dataset, we need to restructure the data through tokenization. In our case a token is a single word. We want one-token-per-row. The unnest_tokens() function (tidytext package) will convert a data frame with a text column into the one-token-per-row format.
Token
Tokenization
defined
The default tokenizing mode is “words”. With the unnest_tokens() function, tokens can be: words, characters, character_shingles, ngrams, skip_ngrams, sentences, lines, paragraphs, regex, tweets, and ptb (Penn Treebank).
Process
- Group by line number (above)
- Make each single word a token
tidy_books <- original_books %>%
unnest_tokens(word, text)
tidy_books## # A tibble: 725,055 x 3
## book line word
## <fct> <int> <chr>
## 1 Sense & Sensibility 1 sense
## 2 Sense & Sensibility 1 and
## 3 Sense & Sensibility 1 sensibility
## 4 Sense & Sensibility 3 by
## 5 Sense & Sensibility 3 jane
## 6 Sense & Sensibility 3 austen
## 7 Sense & Sensibility 5 1811
## 8 Sense & Sensibility 10 chapter
## 9 Sense & Sensibility 10 1
## 10 Sense & Sensibility 13 the
## # ... with 725,045 more rows
Now that the data is in the one-word-per-row format, we can manipulate it with tidy tools like dplyr.
Stop Words
tidytext::get_stopwords()
Remove stop-words from the books.
matchwords_books <- tidy_books %>%
anti_join(get_stopwords())## Joining, by = "word"
matchwords_books## # A tibble: 325,084 x 3
## book line word
## <fct> <int> <chr>
## 1 Sense & Sensibility 1 sense
## 2 Sense & Sensibility 1 sensibility
## 3 Sense & Sensibility 3 jane
## 4 Sense & Sensibility 3 austen
## 5 Sense & Sensibility 5 1811
## 6 Sense & Sensibility 10 chapter
## 7 Sense & Sensibility 10 1
## 8 Sense & Sensibility 13 family
## 9 Sense & Sensibility 13 dashwood
## 10 Sense & Sensibility 13 long
## # ... with 325,074 more rows
Join types
Customize your dictionaries
You can customize stop-words data frames, sentiment data frames, etc.
There are various stop words dictionaries. Here we add the stop word, “farfegnugen” to a custom dictionary. If Jane Austen ever used the word “farfegnugen” that would be weird, or bad. So we will take pains to not calculate the sentiment of that word - whether or not the term shows up in a sentiment dictionary. That is, we will remove the word by making it a part of a customized stop-words dictionary.
stopwords::stopwords_getsources()## [1] "snowball" "stopwords-iso" "misc" "smart"
## [5] "marimo" "ancient" "nltk" "perseus"
stopwords::stopwords_getlanguages("snowball")## [1] "da" "de" "en" "es" "fi" "fr" "hu" "ir" "it" "nl" "no" "pt" "ro" "ru" "sv"
stopwords_custom <- tribble(~word, ~lexicon,
"farfegnugen", "custom")
stopwords_custom## # A tibble: 1 x 2
## word lexicon
## <chr> <chr>
## 1 farfegnugen custom
get_stopwords(source = "snowball")## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ... with 165 more rows
bind_rows(get_stopwords(), stopwords_custom) # The default is "snowball"## # A tibble: 176 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # ... with 166 more rows
Calculate word frequency
How many Austen countable words are there if we remove snowball stop-words? There are 14375 countable words.
matchwords_books %>%
# distinct(word)
count(word, sort = TRUE) ## # A tibble: 14,375 x 2
## word n
## <chr> <int>
## 1 mr 3015
## 2 mrs 2446
## 3 must 2071
## 4 said 2041
## 5 much 1935
## 6 miss 1855
## 7 one 1831
## 8 well 1523
## 9 every 1456
## 10 think 1440
## # ... with 14,365 more rows
Word clouds
matchwords_books %>%
count(word, sort = TRUE) %>%
head(100) %>%
wordcloud2(size = .4, shape = 'triangle-forward',
color = c("steelblue", "firebrick", "darkorchid"),
backgroundColor = "salmon")Basic word cloud
A non-interactive word cloud.
matchwords_books %>%
count(word) %>%
with(wordcloud::wordcloud(word, n, max.words = 100))Your Turn: Exercise 1
Goal: Make a basic word cloud for the novel, Pride and Predjudice, pride_prej_novel
- Prepare
pride_prej_novel <- tibble(text = prideprejudice) %>%
mutate(line = row_number())Tokenize
pride_prej_novelwithunnest_tokens()Remove stop-words
calculate word frequency
make a simple wordcloud
Sentiment Analysis
get_sentiments()
Let’s see what positive words exist in the bing dictionary. Then, count the frequency of those positive words that exist in Emma.
positive <- get_sentiments("bing") %>%
filter(sentiment == "positive") # get POSITIVE words
positive ## # A tibble: 2,005 x 2
## word sentiment
## <chr> <chr>
## 1 abound positive
## 2 abounds positive
## 3 abundance positive
## 4 abundant positive
## 5 accessable positive
## 6 accessible positive
## 7 acclaim positive
## 8 acclaimed positive
## 9 acclamation positive
## 10 accolade positive
## # ... with 1,995 more rows
tidy_books %>%
filter(book == "Emma") %>% # only the book _emma_
semi_join(positive) %>% # semi_join()
count(word, sort = TRUE)## Joining, by = "word"
## # A tibble: 668 x 2
## word n
## <chr> <int>
## 1 well 401
## 2 good 359
## 3 great 264
## 4 like 200
## 5 better 173
## 6 enough 129
## 7 happy 125
## 8 love 117
## 9 pleasure 115
## 10 right 92
## # ... with 658 more rows
Prepare to visualize sentiment score
Match all the Austen books to the bing sentiment dictionary. Count the word frequency.
tidy_books %>%
inner_join(get_sentiments("bing")) %>%
count(book)## Joining, by = "word"
## # A tibble: 6 x 2
## book n
## <fct> <int>
## 1 Sense & Sensibility 8604
## 2 Pride & Prejudice 8704
## 3 Mansfield Park 11577
## 4 Emma 11966
## 5 Northanger Abbey 5762
## 6 Persuasion 5674
Calculate sentiment
Algorithm: sentiment = positive - negative
Define a section of text.
"Small sections of text may not have enough words in them to get a good estimate of sentiment while really large sections can wash out narrative structure. For these books, using 80 lines works well, but this can vary depending on individual texts… – Text Mining with R
bing <- get_sentiments("bing")
janeaustensentiment <- tidy_books %>%
inner_join(bing) %>%
count(book, index = line %/% 80, sentiment) %>% # `%/%` = int division ; 80 lines / section
pivot_wider(names_from = sentiment, values_from = n, values_fill = 0) %>% # spread(sentiment, n, fill = 0)
mutate(sentiment = positive - negative) # ALGO!!!## Joining, by = "word"
janeaustensentiment## # A tibble: 920 x 5
## book index negative positive sentiment
## <fct> <dbl> <int> <int> <int>
## 1 Sense & Sensibility 0 16 32 16
## 2 Sense & Sensibility 1 19 53 34
## 3 Sense & Sensibility 2 12 31 19
## 4 Sense & Sensibility 3 15 31 16
## 5 Sense & Sensibility 4 16 34 18
## 6 Sense & Sensibility 5 16 51 35
## 7 Sense & Sensibility 6 24 40 16
## 8 Sense & Sensibility 7 23 51 28
## 9 Sense & Sensibility 8 30 40 10
## 10 Sense & Sensibility 9 15 19 4
## # ... with 910 more rows
Viz it
janeaustensentiment %>%
ggplot(aes(index, sentiment, )) +
geom_col(show.legend = FALSE, fill = "cadetblue") +
geom_col(data = . %>% filter(sentiment < 0), show.legend = FALSE, fill = "firebrick") +
geom_hline(yintercept = 0, color = "goldenrod") +
facet_wrap(~ book, ncol = 2, scales = "free_x") Preparation: Most common positive and negative words
bing_word_counts <- tidy_books %>%
inner_join(bing) %>%
count(word, sentiment, sort = TRUE)## Joining, by = "word"
bing_word_counts## # A tibble: 2,585 x 3
## word sentiment n
## <chr> <chr> <int>
## 1 miss negative 1855
## 2 well positive 1523
## 3 good positive 1380
## 4 great positive 981
## 5 like positive 725
## 6 better positive 639
## 7 enough positive 613
## 8 happy positive 534
## 9 love positive 495
## 10 pleasure positive 462
## # ... with 2,575 more rows
Viz it too
bing_word_counts %>%
filter(n > 170) %>%
mutate(n = if_else(sentiment == "negative", - n, n)) %>%
ggplot(aes(fct_reorder(str_to_title(word), n), n, fill = str_to_title(sentiment))) +
geom_col() +
coord_flip() +
scale_fill_brewer(type = "qual") +
guides(fill = guide_legend(reverse = TRUE)) +
labs(title = "Frequency of popular positive and negative words",
subtitle = "Jane Austen novels",
y = "Compound sentiment score", x = "",
fill = "Sentiment", caption = "Source: library(janeaustenr)") +
theme(plot.title.position = "plot")Dictionaries
What other dictionaries are available? How to choose?
- Without Dictiionaries there is no sentiment analysis
- Sentiment Analysis: Analyzing Lexicon Quality and Estimation Errors
- Limits of the Bing, AFINN, and NRC Lexicons with the Tidytext Package in R
- Case Study with Harry Potter
head(get_sentiments("bing"))## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 2-faces negative
## 2 abnormal negative
## 3 abolish negative
## 4 abominable negative
## 5 abominably negative
## 6 abominate negative
head(get_sentiments("loughran"))## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 abandon negative
## 2 abandoned negative
## 3 abandoning negative
## 4 abandonment negative
## 5 abandonments negative
## 6 abandons negative
head(get_sentiments("nrc"))## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
head(get_sentiments("afinn"))## # A tibble: 6 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
get_sentiments("nrc") %>%
count(sentiment, sort = TRUE) ## # A tibble: 10 x 2
## sentiment n
## <chr> <int>
## 1 negative 3324
## 2 positive 2312
## 3 fear 1476
## 4 anger 1247
## 5 trust 1231
## 6 sadness 1191
## 7 disgust 1058
## 8 anticipation 839
## 9 joy 689
## 10 surprise 534
Afinn
What words in Emma match the AFINN dictionary?
emma_afinn <- tidy_books %>%
filter(book == "Emma") %>%
anti_join(get_stopwords()) %>%
inner_join(get_sentiments("afinn"))## Joining, by = "word"
## Joining, by = "word"
emma_afinn## # A tibble: 10,159 x 4
## book line word value
## <fct> <int> <chr> <dbl>
## 1 Emma 15 clever 2
## 2 Emma 15 rich 2
## 3 Emma 15 comfortable 2
## 4 Emma 16 happy 3
## 5 Emma 16 best 3
## 6 Emma 18 distress -2
## 7 Emma 20 affectionate 3
## 8 Emma 22 died -3
## 9 Emma 24 excellent 3
## 10 Emma 25 fallen -2
## # ... with 10,149 more rows
emma_afinn %>%
count(word, sort = TRUE)## # A tibble: 894 x 2
## word n
## <chr> <int>
## 1 miss 599
## 2 good 359
## 3 great 264
## 4 dear 241
## 5 like 200
## 6 better 173
## 7 hope 143
## 8 poor 136
## 9 wish 135
## 10 happy 125
## # ... with 884 more rows
Make Sections
Just as we calculated sentiment, above, make sections of 80 words then calculate sentiment.
emma_afinn_sentiment <- emma_afinn %>%
mutate(word_count = 1:n(),
index = word_count %/% 80) %>%
group_by(index) %>%
summarise(sentiment = sum(value)) ## ALGO sum each Afinn score in the 80 word section## `summarise()` ungrouping output (override with `.groups` argument)
emma_afinn_sentiment## # A tibble: 127 x 2
## index sentiment
## <dbl> <dbl>
## 1 0 40
## 2 1 33
## 3 2 77
## 4 3 84
## 5 4 52
## 6 5 80
## 7 6 98
## 8 7 80
## 9 8 69
## 10 9 68
## # ... with 117 more rows
Viz it
emma_afinn %>%
mutate(word_count = 1:n(),
index = word_count %/% 80) %>%
filter(index == 104) %>%
count(word, sort = TRUE) %>%
wordcloud2(size = .4, shape = 'diamond',
backgroundColor = "darkseagreen")emma_afinn_sentiment %>%
ggplot(aes(index, sentiment)) +
geom_col(aes(fill = cut_interval(sentiment, n = 5))) +
geom_hline(yintercept = 0, color = "forestgreen", linetype = "dashed") +
scale_fill_brewer(palette = "RdBu", guide = FALSE) +
theme(panel.background = element_rect(fill = "grey"),
plot.background = element_rect(fill = "grey"),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()) +
labs(title = "Afinn Sentiment Analysis of _Emma_")emma_afinn %>%
mutate(word_count = 1:n(),
index = as.character(word_count %/% 80)) %>%
filter(index == 10 | index == 104 | index == 105) %>%
ggplot(aes(value, index)) +
geom_boxplot() +
# geom_boxplot(notch = TRUE) +
geom_jitter() +
coord_flip() +
labs(y = "section", x = "Afinn")Resources
- Tidytext package
- Book: Text Mining with R by Silge and Robinson
- Data Wrangling with dplyr: (video | workshop)
- Data Visualization with ggplot2: (video | workshop)
John Little
Rfun
Center for Data & Visualization Sciences
CC BY-NC
Creative Commons: Attribution, Non-commercial
https://creativecommons.org/licenses/by-nc/4.0/